Thesis
- Ranking rules:
consensus- Scoring ranking rules
- Plurality(#plurality)
- k-rules(#krules)
- Borda count(#borda)
- Scoring ranking rules
- Distances
- Algorithms
- Datasets
- Examples
Ranking rules: consensus
Scoring ranking rules
Plurality
k-rules
Borda count
Distances
Categorical features
Numerical features
Mixed features
Algorithms
dknn
Sets of distances
rknn
Datasets
List
- Binary
- Categorical attributes
- Less than 10 attributes
- 10 or more attributes
- Mixed: categorical and numerical attributes
- Less than 10 attributes
- 10 or more attributes
- Numeric attributes
- Less than 10 attributes
- 10 or more attributes
- Categorical attributes
- Multiclass
- Categorical attributes
- Less than 10 attributes
- 10 or more attributes
- Mixed: categorical and numerical attributes
- Less than 10 attributes
- 10 or more attributes
- Numeric attributes
- Less than 10 attributes
- 10 or more attributes
- Categorical attributes
Binary
Categorical attributes
Less than 10 attributes
Breast Cancer
This is one of three datasets provided by the Oncology Institute that has repeatedly appeared in the machine learning literature.
This data set includes 201 instances of one class and 85 instances of another class. The instances are described by 9 attributes. In this version of the dataset all the attributes are nominal.
- Source: UCI Machile Learning Repository
- Number of rows: 277
- Number of attributes: 9
Description of the attributes:
Cars
Car Evaluation Database was derived from a simple hierarchical decision mode. The attributes include: buying price, maint price of the maintenance, number of doors, persons capacity in terms of persons to carry, lug_boot the size of luggage boot, safety estimated safety of the car and class. The class is the car acceptability and its possible values are: unacc, acc, good, vgood.
- Source: UCI Machile Learning Repository
- Number of rows: 1728
- Number of attributes: 6
Description of the attributes:
Somerville
The skin dataset is collected by randomly sampling B,G,R values from face images of various age groups (young, middle, and old), race groups (white, black, and asian), and genders obtained from FERET database and PAL database. Total learning sample size is 245057; out of which 50859 is the skin samples and 194198 is non-skin samples. Color FERET Image Database: [Web Link], PAL Face Database from Productive Aging Laboratory, The University of Texas at Dallas: [Web Link]. This dataset is of the dimension 245057 * 4 where first three columns are B,G,R (x1,x2, and x3 features) values and fourth column is of the class labels (decision variable y).
- Source: UCI Machile Learning Repository
- Number of rows: 143
- Number of attributes: 6
Description of the attributes:
Tic-Tac-Toe
This database encodes the complete set of possible board configurations at the end of tic-tac-toe games, where “x” is assumed to have played first. The target concept is “win for x” (i.e., true when “x” has one of 8 possible ways to create a “three-in-a-row”).
- Source: UCI Machile Learning Repository
- Number of rows: 958
- Number of attributes: 9
Description of the attributes:
10 or more attributes
Mixed: categorical and numerical attributes
Less than 10 attributes
Cesarean
Mammographic masses
10 or more attributes
Travel insurance
- Source: Kaggle
- Number of rows: 18219
- Number of attributes: 10
Description of the attributes:
Numeric attributes
Less than 10 attributes
Banknote authentication
- Source: UCI Machile Learning Repository
- Number of rows: 1372
- Number of attributes: 4
Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images.
Description of the attributes:
- variance of Wavelet Transformed image (continuous)
- skewness of Wavelet Transformed image (continuous)
- curtosis of Wavelet Transformed image (continuous)
- entropy of image (continuous)
- class
Haberman
- Source: UCI Machile Learning Repository
- Number of rows: 306
- Number of attributes: 3
The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer.
- Age of patient at time of operation (numerical)
- Patient’s year of operation (year - 1900, numerical)
- Number of positive axillary nodes detected (numerical)
- Survival status (class attribute)
- 1 = the patient survived 5 years or longer
- 2 = the patient died within 5 year
Description of the attributes:
Skin segmentation
10 or more attributes
Multiclass
Categorical attributes
Less than 10 attributes
Balance Scale
This data set was generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. The attributes are the left weight, the left distance, the right weight, and the right distance. The correct way to find the class is the greater of \((left\_distance * left\_weight)\) and \((right\_distance * right\_weight)\). If they are equal, it is balanced.
- Source: UCI Machile Learning Repository
- Number of rows: 625
- Number of attributes: 4
Description of the attributes:
Chess
Post operative data
10 or more attributes
Poker hand
Mixed: categorical and numerical attributes
Less than 10 attributes
Teaching-assistant
- Source: UCI Machile Learning Repository
Abalone
Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope – a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.
From the original data examples with missing values were removed (the majority having the predicted value missing), and the ranges of the continuous values have been scaled for use with an ANN (by dividing by 200).
- Source: UCI Machile Learning Repository
- Number of rows: 4177
- Number of attributes: 8
Description of the attributes:
Life expectancy
10 or more attributes
Numeric attributes
Less than 10 attributes
Life expectancy
Seeds
10 or more attributes
This database encodes the complete set of possible board configurations at the end of tic-tac-toe games, where “x” is assumed to have played first. The target concept is “win for x” (i.e., true when “x” has one of 8 possible ways to create a “three-in-a-row”).
- Source: UCI Machile Learning Repository
- Number of rows: 958
- Number of attributes: 9
Description of the attributes:
Notes
Changes to the original datasets if any:
Examples
The mini_iris dataset
Manhattan distance with mini_iris
Euclidean distance with mini_iris
Matrix of distances:
| test | X1 | X2 | X3 | X5 | X6 | X7 | X9 | X10 | X11 |
|---|---|---|---|---|---|---|---|---|---|
| X4 | 2.144761 | 1.627882 | 0.7141428 | 4.5607017 | 3.9648455 | 3.8923001 | 4.160529 | 6.522270 | 4.2731721 |
| X8 | 3.491418 | 3.295451 | 3.6687873 | 0.9643651 | 0.4358899 | 0.3464102 | 1.104536 | 2.917190 | 0.7937254 |
| X12 | 4.713809 | 4.570558 | 4.9909919 | 0.8831761 | 1.2083046 | 1.2369317 | 1.195826 | 1.838478 | 0.8366600 |
Ranking for each instance:
| testtrain | X1 | X2 | X3 | X5 | X6 | X7 | X9 | X10 | X11 |
|---|---|---|---|---|---|---|---|---|---|
| X4 | 3 | 2 | 1 | 8 | 5 | 4 | 6 | 9 | 7 |
| X8 | 8 | 7 | 9 | 4 | 2 | 1 | 5 | 6 | 3 |
| X12 | 8 | 7 | 9 | 2 | 4 | 5 | 3 | 6 | 1 |
Train dknn k = 3, distance = euclidean, ties = randomly
nrow(train) = 9 and nrow(test) = 3
Predict... Choosing a label [method = randomly , k = 3] for the instance X4:
setosa setosa setosa versicolor versicolor versicolor virginica virginica virginica
3 2 1 8 5 4 6 9 7
setosa > setosa > setosa > versicolor > versicolor > virginica > virginica > versicolor > virginica
--> Selected values:
setosa setosa setosa
--> Probabilities:
setosa versicolor virginica
1 0 0
The label for this instance is: setosa
Predict... Choosing a label [method = randomly , k = 3 ] for the instance with ranking:
setosa setosa setosa versicolor versicolor versicolor virginica virginica virginica
8 7 9 4 2 1 5 6 3
versicolor > versicolor > virginica > versicolor > virginica > virginica > setosa > setosa > setosa
--> Selected values:
versicolor versicolor virginica
--> Probabilities:
setosa versicolor virginica
0.0000000 0.6666667 0.3333333
The label for this instance is: versicolor
Predict... Choosing a label [method = randomly , k = 3 ] for the instance with ranking:
setosa setosa setosa versicolor versicolor versicolor virginica virginica virginica
8 7 9 2 4 5 3 6 1
virginica > versicolor > virginica > versicolor > versicolor > virginica > setosa > setosa > setosa
--> Sure values:
versicolor virginica
2 1
--> Tied values:
virginica
3
Solving the ties... randomly
--> Number of elements to select randomly: 1
--> Selected values:
[1] versicolor virginica virginica
Levels: setosa versicolor virginica
times
setosa versicolor virginica
0.0000000 0.3333333 0.6666667
The label for this instance is: virginica
[1] "setosa" "versicolor" "virginica"
> sink("iris_manhattan_randomly")